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Gaussian Mixture Models with Component 
Means Constrained in Pre-selected Subspaces 

Mu Qiao* Jia Li’^ 


Abstract 

We investigate a Gaussian mixture model (GMM) with component means constrained in a 
pre-selected subspace. Applications to classification and clustering are explored. An EM-type 
estimation algorithm is derived. We prove that the subspace containing the component means 
of a GMM with a common covariance matrix also contains the modes of the density and the 
class means. This motivates us to find a subspace by applying weighted principal component 
analysis to the modes of a kernel density and the class means. To circumvent the difficulty of 
deciding the kernel bandwidth, we acquire multiple subspaces from the kernel densities based 
on a sequence of bandwidths. The GMM constrained by each subspace is estimated; and the 
model yielding the maximum likelihood is chosen. A dimension reduction property is proved 
in the sense of being informative for classification or clustering. Experiments on real and 
simulated data sets are conducted to examine several ways of determining the subspace and 
to compare with the reduced rank mixture discriminant analysis (MDA). Our new method 
with the simple technique of spanning the subspace only by class means often outperforms 
the reduced rank MDA when the subspace dimension is very low, making it particularly 
appealing for visualization. 

Keywords: Gaussian mixture model, subspace constrained, modal PGA, dimension reduction, 
visualization 


1 Introduction 


The Gaussian mixture model (GMM) is a popular and effective tool for clustering and classi¬ 
fication. When applied to clustering, usually each cluster is modeled by a Gaussian distribution. 
Because the cluster labels are unknown, we face the issue of estimating a GMM. A thorough 
treatment of clustering by GMM is referred to (McLachlan and Peel, |2000a). Hastie and Tibshi- 


rani 


(1996) proposed the mixture discriminant analysis (MDA) for classihcation, which assumes 
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a GMM for each class. Fraley and Raftery (2002) examined the roles of GMM for clustering, 
classihcation, and density estimation. 

As a probability density, GMM enjoys great flexibility comparing with parametric distributions. 
Although GMM can approximate any smooth density by increasing the number of components 
R, the number of parameters in the model grows quickly with R, especially for high dimensional 
data. The regularization of GMM has been a major research topic on mixture models. Early efforts 
focused on controlling the complexity of the covariance matrices, partly driven by the frequent 

). More recently, it is 


occurrences of singular matrices in estimation (Fraley and Raftery 


noted that for data with very high dimensions, a mixture model with parsimonious covariance 
structures, for instance, common diagonal matrices, may still have high complexity due to the 
component means alone. Methods to regularize the component means have been proposed from 


quite different perspectives. Li and Zha (2006) developed the so-called two-way mixture of Poisson 
distributions in which the variables are grouped and the means of the variables in the same group 
within any component are assumed identical. The grouping of the variables reduces the number of 
parameters in the component means dramatically. In the same spirit, Qiao and Li (2010| developed 


the two-way mixture of Gaussians. Pan and Shen (2007) explored the penalized likelihood method 
with Li norm penalty on the component means. The method aims at shrinking the component 
means of some variables to a common value. Variable selection for clustering is achieved because 
the variables with common means across all the components are non-informative for cluster labels. 


Wang and Zhu (2008) proposed the Loo norm as a penalty instead. 


In this paper, we propose another approach to regularizing the component means in GMM, 


which is more along the line of reduced rank MDA (Hastie and Tibshirani, 1996) but with profound 
differences. We search for a linear subspace in which the component means reside and estimate a 
GMM under such a constraint. The constrained GMM has a dimension reduction property. It is 
proved that with the subspace restriction on the component means and under common covariance 
matrices, only a linear projection of the data with the same dimension as the subspace matters 
for classihcation and clustering. The method is especially useful for visualization when we want 
to view data in a low dimensional space which best preserves the classihcation and clustering 
characteristics. 
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The idea of restricting component means to a linear subspace was first explored in the linear 


discriminant analysis (LDA). Fisher (1936) proposed to find a subspace of rank r < K, where K 
is the number of classes, so that the projected class means are spread apart maximally, The coor¬ 
dinates of the optimal subspace are derived by successively maximizing the between-class variance 
relative to the within-class variance, known as canonical or discriminant variables. Although LDA 
does not involve the estimation of a mixture model, the marginal distribution of the observation 


without the class label is a mixture distribution. The idea of reduced rank LDA was used by Hastie 


and Tibshirani (1996) for GMM. It was proved in Hastie and Tibshirani (1996) that reduced rank 


LDA can be viewed as a Gaussian maximum likelihood solution with the restriction that the 
means of Gaussians lie in a L-dimension subspace, i.e., rank{/ifc}{^ = L < max(iF — l,p), where 


fikS are the means of Gaussians and p is the dimension of the data. Hastie and Tibshirani (1996) 
extended this concept and proposed a reduced rank version of the mixture discriminant analysis 
(MDA), which performed a reduced rank weighted LDA in each iteration of the EM algorithm. 
Another related line of research is regularizing the component means of a GMM in a latent 


factor space, i.e., the use of factor analysis in GMM. It was originally proposed by Ghahramani and 


Hinton (|1997) to perform concurrent clustering and dimension reduction using mixture of factor 


analyzers; see also McLachlan and Peel ( |2000b ) and McLachlan et ah (2003). Factor analysis was 
later used to regularize the component means of a GMM in each state of the Hidden Markov Model 


(HMM) for speaker verihcation (P. Kenny et ah, 2008; D. Povey et ah, 2011). In those models, 
the total number of parameters is significantly reduced due to the regularization, which effectively 
prevents over fitting. Usually the EM type of algorithm is applied to estimate the parameters and 
find the latent subspace. 

The role of the subspace constraining the means differs intrinsically between our approach, the 
reduced rank MDA and factor analysis based mixture models, resulting in mathematical solutions 
of quite different nature. Within each iteration of the EM algorithm for estimating a GMM, the 
reduced rank MDA finds the subspace with a given dimension that yields the maximum likelihood 
under the current partition of the data into the mixture components. The subspace depends on the 
component-based clustering of data in each iteration. Similarly, the subspaces in factor analysis 
based mixture models are found through the iterations of the EM algorithm, as part of the model 
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estimation. However, in our method, we treat the seek of the subspace and the estimation of the 
model separately. The subspace is hxed throughout the estimation of the GMM. Mathematically 
speaking, we try to solve the maximum likelihood estimation of GMM under the condition that 
the component means lie in a given subspace. 

Our formulation of the model estimation problem allows us to exploit multiple and better 
choices of density estimate when we seek the constraining subspace. For instance, if we want to 
visualize the data in a plane while the component means are not truly constrained to a plane, 
htting a GMM with means constrained to a plane may lead to poor density estimation. As a 
result, the plane sought during the estimation will be problematic. It is thus sensible to hnd the 
plane based on a density estimate without the constraint. Afterward, we can £t a GMM under the 
constraint purely for the purpose of visualization. Moreover, the subspace may be specihed based 
on prior knowledge. For instance, in multi-dimensional data visualization, we may already know 
that the component (or cluster) means of data lie in a subspace spanned by several dimensions of 
the data. Therefore, the subspace is required to be hxed. 

We propose two approaches to hnding the unknown subspace. The hrst approach is the so- 
called modal PC A (MPCA). We prove that the modes (local maxima) he in the same constrained 


subspace as the component means. We use the modal EM (MEM) algorithm (Li et ah, 2007) 
to hnd the modes. By exploiting the modes, we are no longer restricted to the GMM as a tool 
for density estimation. Instead, we use the kernel density estimate which avoids sensitivity to 
initialization. There is an issue of choosing the bandwidth, which is easier than usual in our 
framework by the following strategy. We take a sequence of subspaces based on density estimates 
resulting from diherent kernel bandwidths. We then estimate GMMs under the constraint of each 
subspace and finally choose a model yielding the maximum likelihood. Note that, each GMM is 
a full model for the original data, although the component means are constrained in a diherent 
subspace. We therefore can compare the estimated likelihood under each model. This framework 


in fact extends beyond kernel density estimation. As discussed in (Li et ah, 2007), modes can be 
found using modal EM for any density in the form of a mixture distribution. The second approach 
is an extension of MPGA which exploits class means or a union set of modes and class means. It 
is easy to see that the class means also reside in the same constrained subspace as the component 
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means. Comparing with modes, class means do not depend on the kernel bandwidth and are more 
robust to estimate. 


Experiments on the classihcation of several real data sets with moderate to high dimensions 
show that reduced rank MDA does not always have good performance. We do not intend to 
claim that our proposed method is necessarily better than reduced rank MDA. However, when the 


constraining subspace of the component means is of a very low dimension, we find that the proposed 
method with the simple technique of finding the subspace based on class means often outperforms 
the reduced rank MDA, which solves a discriminant subspace via a much more sophisticated 
approach. In addition, we compare our methods with standard MDA on the data projected to 
the subspace containing the component means. For data with moderately high dimensions, our 
proposed method works better. Besides classihcation, our method easily applies to clustering. 

The rest of the paper is organized as follows. In Section we review some background and 
notation. We present a Gaussian mixture model with subspace constrained component means, 
the MPCA algorithm and its extension for hnding the subspace in Section We also present 
several properties of the constrained subspace, with detailed proofs in the appendix. In Section 
we describe the estimation algorithm for the proposed model. Experimental results are provided 
in Section!^ Finally, we conclude and discuss future work in Section]^ 

2 Preliminaries and Notation 

Let X = (Ai, A 2 , ...jXpY, where p is the dimension of the data. A sample of X is denoted 
by a; = (xi, 0 : 2 ,..., Xp)*. We present the notations for a general Gaussian mixture model before 
introducing the mixture model with component means constrained to a given subspace. Gaussian 
mixture model can be applied to both classihcation and clustering. Let the class label of X be 
Y G = {1, 2,..., A}. For classihcation purpose, the joint distribution of X and Y under a 
Gaussian mixture is 



( 1 ) 


where is the prior probability of class k, satisfying 0 < < 1 and Yl!k=i fk{x) is the 

within-class density for X. is the number of mixture components used to model class fc, and 


5 


the total number of mixture components for all the classes is R = 'Yl!k=i ^k- Let T^kr be the mixing 
proportions for the rth component in class fc, 0 < ^kr < 1, = 1- 0(-) denotes the pdf of 

a Gaussian distribution; is the mean vector for component r in class k and S is the common 
covariance matrix shared across all the components in all the classes. To classify a sample X = x, 
the Bayes classihcation rule is used: y = argmax;i./(y = k\X = x) = axgmaXj^f^X = x,Y = k). 

In the context of clustering, the Gaussian mixture model is now simplihed as 

R 

f{X = x) = ^7rr(j){x\fi^,'S) , (2) 

r=l 


where R is the total number of mixture components and vr^ is the mixing proportions for the 
rth component, and S denote the rth component mean and the common covariance matrix 
for all the components. The clustering procedure involves hrst htting the above mixture model 
and then computing the posterior probability of each mixture component given a sample point. 
The component with the highest posterior probability is chosen for that sample point, and all the 
points belonging to the same component form one cluster. 

In this work, we assume that the Gaussian component means reside in a given linear subspace 
and estimate a GMM with subspace constrained means. A new algorithm, namely the modal 
RCA (MPCA), is proposed to hnd the constrained subspace. The motivations of using modes to 


hnd subspace are outlined in Section 3.1 Before we present MPGA, we will hrst introduce the 


modal EM algorithm (Li et ah, 2007) which solves the local maxima, that is, modes, of a mixture 
density. 

Modal EM: Given a mixture density f{X = x) = J2^=i'^rfrix), as in model (j^, starting 
from any initial data point the modal EM algorithm Ends a mode of the density by alternating 
the following two steps until a stopping criterion is met. Start with t = 0. 


1. Let pr = , r = 1,..., i?. 

2. Update ajd+i) = argmax^. log/r(aj). 


The above two steps are similar to the expectation and the maximization steps in EM (Demp¬ 


ster et ah, 1977). The hrst step is the “expectation” step where the posterior probability of each 
mixture component r, 1 < r < i?, at the current data point ajd) jg computed. The second step 
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is the “maximization” step. '^^^iPr^og fr{x) has a unique maximum, if the fr{x)^s are normal 
densities. In the special case of a mixture of Gaussians with common covariance matrix, that is, 
fr{x) = (f){x I Hj,, S), we simply have = J2r=iPrl-‘'r- uiodal EM, the probability density 

function of the data is estimated nonparametrically using Gaussian kernels, which are in the form 
of a Gaussian mixture distribution: 

n 

f[x = x) = Y^ -(j){x I Xi, S), 

^ n 

i=l 

where the Gaussian density function is 


(f){x I Xi, S) = 




We use a spherical covariance matrix S = diag{a‘^,a‘^, The standard deviation a is also 

referred to as the bandwidth of the Gaussian kernel. When the bandwidth of Gaussian kernels 
increases, the density estimate becomes smoother, and more data points tend to ascend to the 
same mode. Different numbers of modes can thus be found by gradually increasing the bandwidth 
of Gaussian kernels. The data points are grouped into one cluster if they climb to the same mode. 
We call the mode as the cluster representative. 


In (Li et ah, 2007), a hierarchical clustering approach, namely. Hierarchical Mode Association 
Clustering (HMAC), is proposed based on mode association and kernel bandwidth growth. Given 
a sequence of bandwidths Ui < (72 < ■ ■ ■ < a,,, HMAG starts with every point Xi being a cluster by 
itself, which corresponds to the extreme case that a\ approaches 0. At any bandwidth ai{l > 1), 
the modes, that is, cluster representatives, obtained from the preceding bandwidth are input into 
the modal EM algorithm. The modes identihed then form a new set of cluster representatives. 


This procedure is repeated across all (j/’s. For details of HMAG, we refer interested readers to (Li 


et ah, 2007). We therefore obtain modes at different levels of bandwidth by HMAG. The clustering 


performed by HMAG is only for the purpose of Ending modes across diEerent bandwidths and 
should not be confused with the clustering or classification based on the Gaussian mixture model 
we propose here. 
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3 GMM with Subspace Constrained Means 


The Gaussian mixture model with subspace constrained means is presented in this section. 
For brevity, we focus on the constrained mixture model in a classification set-up, since clustering 
can be treated as a “one-class” modeling and is likewise solved. 

We propose to model the within-class density by a Gaussian mixture with component means 
constrained to a pre-selected subspace: 

Rk 

fk{x) = S) (3) 

r=l 

v] ■ Mfcl = ^lk2 = ■■■ = v] ■ ^lkR, = Cj , (4) 

where n^-’s are linearly independent vectors, j = l,...,g, q < p, and Cj is a constant, invariant 
to different classes. Without loss of generality, we can assume {ni,...,^^} span an orthonormal 
basis. Augment it to full rank by {ug+i,..., Vp}. Suppose ly = {ug+i,..., Vp}, = {ui,..., Vg}, and 
c = (ci, C 2 ,..., Cq)h Denote the projection of a vector ^ or a matrix U onto an orthonormal basis 
S by Proj|^ or Proj^. We have Proj|^*’' = c over all the k and r. That is, the projections of all 
the component means o^^o the subspace coincide at c. We refer to u as the constrained 
sub space where reside (or more strictly, reside in the subspace up to a translation), and 

as the corresponding null subspace. Suppose the dimension of the constrained subspace h> is d, 
then d = p — q. With the constraint (|^ and the assumption of a common covariance matrix across 
all the components in all the classes, essentially, we assume that the data within each component 
have identical distributions in the null space In the following section, we will explain how to 
hnd an appropriate constrained subspace u. 

3.1 Modal PCA 

We introduce in this section the modal PGA (MPGA) algorithm that hnds a constrained 
subspace for the component means of a Gaussian mixture and the properties of the found subspace. 
We prove in Appendix A the following theorem. 

Theorem 3.1 For a Gaussian mixture model with component means constrained in a subspace 
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V = {Vq+1, ...,Vp}, q < p, and a common covariance matrix across all the components, the modes 
of the mixture density are also constrained in the same subspace u. 


According to Theorem |3.1[ the modes and component means of Ganssian mixtnres reside in the 
same constrained snbspace. We nse the aforementioned MEM algorithm introdnced in Section|^to 
hnd the modes of the density. To avoid sensitivity to initialization and the nnmber of components, 
we nse the Ganssian kernel density estimate instead of a hnite mixtnre model for the density. It 
is well known that mixtnre distributions with drastically different parameters may yield similar 
densities. We are thus motivated to exploit modes which are geometric characteristics of the 
densities. 

Let us denote the set of modes found by MEM under the kernel bandwidth ex by ^ = 
■M. 17 , 2-1 ■■■! M.a,\g\}- A weighted principal component analysis is proposed to hnd the con¬ 
strained subspace. A weight is assigned to the rth mode, which is the proportion of sample 
points in the entire data ascending to that mode. We therefore have a weighted covariance matrix 
of all the modes in Q\ 

\G\ 




r=l 


where pg = The principal components are then obtained by performing an 

eigenvalue decomposition on Eg. Recall the dimension of the constrained subspace u is d. Since 
the leading principal components capture the most variation in the data, we use the hrst d most 
signiheant principal components to span the constrained subspace jz, and the remaining principal 
components to span the corresponding null space 

Given a sequence of bandwidths ai < a 2 < ■ ■ ■ < cr^, the modes at different levels of bandwidth 
can be obtained using the HMAG algorithm introduced in Section At each level, we apply 
the weighted PGA to the modes, and obtain a new constrained subspace by their hrst d most 
signiheant principal components. In practice, if the number of modes found at a particular level 
of bandwidth is smaller than 3, we will skip the modes at that level. For the extreme case, when 
cr = 0, the subspace is actually spanned by the principal components of the original data points. 
We therefore obtain a collection of subspaces, i^i, ..., 1 /^, resulting from a sequence of bandwidths 
through HMAG. 
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3.2 Extension of Modal PCA 


In this section, we propose another approach to generating the constrained subspace, which is 
an extension of MPCA. Suppose the mean of class k is Ai'k, we have Ai'k = where 

is the rth component in class k. It is easy to see that the class means lie in the same subspace as 


the Gaussian mixture component means. From Theorem 3.1, we know that in Gaussian mixtures, 
the modes and component means also reside in the same constrained subspace. So the class means, 
modes and component means all he in the same constrained subspace. Gomparing with the modes, 
class means are more robust to estimate. It is thus natural to incorporate class means to find the 
subspace. In the new approach, if the dimension d of the constrained subspace is smaller than K, 
the subspace is spanned by applying weighted PGA only to class means. Otherwise, it is spanned 
by applying weighted PGA to a union set of modes and class means. 

Similar to modal PGA, we hrst assign a weight to the /cth class mean Ad'fc, which is the 
proportion of the number of sample points in class k over the entire data, i.e., the prior probability 
of class k. Suppose the set of class means is JT” = M' 2 , ■ ■ ■ If d < K, we have a 

weighted covariance matrix of all the class means: 

K 

'Ej = ^ ak{M'r — ^ijY{M'r — /iy) , 

r=l 


where /iy = o.k-M'k- An eigenvalue decomposition on Ej is then performed to obtain all the 
principal components. Similar to MPGA, the constrained subspace is spanned by the hrst d most 
signihcant principal components. If d > K, we will put together all the class means and modes 
and assign different weights to them. Suppose 7 is a value between 0 and 100, we allocate a total 
of 7 % of weight to the class means, and the remaining (100 — 7 )% weights allocated proportionally 
to the modes. That is, the weights assigned to the class mean Ai'k and the mode Aia,r are ■yak% 
and (100 — 7)^0-,r%, respectively. Then the weighted covariance matrix of the union set of class 

means and modes becomes 

K \g\ 

^guj = — hy) + ~ l)u!a,r%{Aia,r — /^g)^(Alcr,r' “ P-g) • 

r=l r=l 


Different weights can be allocated to the class means and the modes. For instance, if we want the 
class means to play a more important role in spanning subspaces, we can set 7 > 50. Again, an 
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eigenvalue decomposition is performed on to obtain all the principal components and the 

hrst d most signihcant principal components span the constrained subspace. To differentiate this 
method from MPCA, we denote it by MPCA-MEAN. 


3.3 Dimension Reduction 

The mixture model with component means under constraint Q implies a dimension reduction 
property for the classihcation purpose, formally stated below. 


Theorem 3.2 For a Gaussian mixture model with a common covariance matrix S, suppose all 
the component mean ’s are constrained in a subspace spanned by u = {r^g+i,..., Vp], q < p, up 
to a translation, only a linear projection of the data x onto a subspace spanned by {H~^Vj\j = 
g + 1, ...,p} (the same dimension as u) is informative for classification. 


In Appendix B, we provide the detailed proof for Theorem 3.2 If the common covariance matrix 
S is an identity matrix (or a scalar matrix), the class label Y only depends on the projection of 
X onto the constrained subspace ly. However, in general, S is non-identity. Hence the spanning 
vectors, j = q + 1, ...,p, for the subspace informative for classihcation are not orthogonal 

in general as well. In Appendix B, we use the column vectors of orth{{'S~^Vj\j = q + l,...,p}) 
to span this subspace. To differentiate it from the constrained subspace in which the component 
means he, we call it as discriminant subspace. The dimension of the discriminant subspace is 
referred to as discriminant dimension, which is the dimension actually needed for classihcation. 
The discriminant subspace is of the same dimension as the constrained subspace. When the 
discriminant dimension is small, signihcant dimension reduction is achieved. Our method can 
thus be used as a data reduction tool for visualization when we want to view the classihcation of 
data in a two or three dimensional space. 


Although in Appendix B we prove Theorem 3^ in the context of classihcation, the proof 
can be easily modihed to show that the dimension reduction property applies to clustering as 
well. That is, we only need the data projected onto a subspace with the same dimension as the 
constrained subspace u to compute the posterior probability of the data belonging to a component 
(aka cluster). Similarly, we name the subspace that matters for clustering as discriminant subspace 
and its dimension as discriminant dimension. 
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4 Model Estimation 


We will first describe in Section 4A the basic version of the estimation algorithm where the 
constraints on the component means are characterized by (|^. A natnral extension to the con¬ 
straint in (4) is to allow the constant Cj to vary with the class labels, thus leading to constraint 


characterized in (10). The corresponding algorithm is described in Section 4.2 


4.1 The Algorithm 

Let us hrst summarize the work flow of our proposed method: 


1. Given a sequence of kernel bandwidths ai < a 2 < ■ ■ ■ < Cr,, apply HMAC to hnd the modes 
of the density estimation at each bandwidth ai. 

2. Apply MPCA or MPCA-MEAN to the modes or a union set of modes and class means at 
each kernel bandwidth and obtain a sequence of constrained subspaces. 

3. Estimate the Gaussian mixture model with component means constrained in each subspace 
and select the model yielding the maximum likelihood. 

4. Perform classihcation on the test data or clustering on the overall data, with the selected 
model from Step 3. 


Remarks: 


1. In our method, the seek of subspace and the estimation of the mixture model are separate. 
We hrst search for a sequence of subspaces and then estimate the model constrained in each 
subspace separately. 

2. In Step 1, the identihed modes are from the density estimation of the overall data (in 
clustering) or the overall training data (in classihcation). 

3. For MPGA-MEAN, if the dimension d of the constrained subspace is smaller than K, the 
subspace is spanned only by class means and is therefore hxed. We do not need to choose 
the subspace. 

4. Some prior knowledge may be exploited to yield an appropriate subspace. Then, we can 
estimate GMM under the constraint of the given subspace directly. 
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Now we will derive an EM algorithm to estimate a GMM under the constraint of a given 
subspace. The estimation method for classihcation is introduced Erst. A common covariance 
matrix S is assumed across all the components in all the classes. In class k, the parameters to 
be estimated include the class prior probability a^, the mixture component prior probabilities Tr^r, 
and the Gaussian parameters S, r = 1,2, ...,Rk- Denote the training data by {{xi,yi) : i = 
1, ...,n}. Let Ufc be the number of data points in class k. The total number of data points n is 
'n!k=i^k- The class prior probability ak is estimated by the empirical frequency Uk/ Yl!k'=i^k' ■ 
The EM algorithm comprises the following two steps: 

1. Expectation-step-. Given the current parameters, for each class k, compute the component 
posteriori probability for each data point Xi within class fc: 

Rk 

qi,kr OC 'Kkr(t){Xi\pL,,^, S) , Subjcct tO E Qi,kr 1 • (5) 

2. Maximization-step-. Update Tr^r, f-ikr^ S, which maximize the following objective function 
(the i subscript indicates Xi with i/i = k): 

K Rk / n^. \ K Rk rik 

EE E Qi,kr j log 71 kr + EEE qi,kr'^Og(j){Xi\Hf,r,'^) (6) 

k=l r=l V*=l / fc=l r=l i=l 

under the constraint (|^. 

In the maximization step, the optimal vr^r’s are not affected by the constraint (|^ and are 
solved separately from Ukr^ and S: 

rik Rk 

Tlkr OC ^ ^ qi,kr ) ^ ^ '^kr 1 • (7) 

i=l r=l 

Since there are no analytic solutions to A^^^’s and S in the above constrained optimization, we 
adopt the generalized EM (GEM) algorithm. Specifically, we use a conditional maximization 
approach. In every maximization step of GEM, we first fix S, and then update the f-ikrS. Then 
we update S conditioned on the At^^’s held fixed. This iteration will be repeated multiple times. 

Given S, solving is non-trivial. The key steps are summarized here. For detailed derivation, 
we refer interested readers to Appendix G. In constraint (j^, we have n* ■ = Cj, i.e., identical 

across all the k and r for j = l,...,g. It is easy to see that c = (ci,...,Cq)* is equal to the 


13 


projection of the mean of the overall data onto the null space However, in practice, we do 
not need the value of c in the parameter estimation. Before we give the equation to solve 
let us dehne a few notations hrst. Assume S is non-singular and hence positive dehnite, we can 
write S = (S2)*(S2), where S 2 is of full rank. If the eigen decomposition of S is S = VsDsV^, 
then S 2 = D^V-^. Let Vnuii be a p x g orthonormal matrix {vi, ...,Vq), the column vectors of 
which span the null space Suppose B = Ti^Vnuii- Perform a singular value decomposition 
(SVD) on B, i.e., B = UbDbVb\ where Ub is a p x g matrix, the column vectors of which 
form an orthonormal basis for the space spanned by the column vectors of B. Let C/ be a column 
augmented orthonormal matrix of Ub- Denote Yll=i%kr by hr- Let Xkr = J2i=iQi,kr^i/^kr, i-e., 
the weighted sample mean of the component r in class k, and x^r = U . x^r- Dehne 

by the following Eqs. and ([^: 


1 . for the hrst g coordinates, j = 1,..., g: 

E K 1 w 

k'=l 2-^r'=l 




n 


, identical over r and k ; 


2 . for the remaining p — q coordinates, j = g -|- 1, ...,p: 


P'krJ ~ ^kr,j 


( 8 ) 

( 9 ) 


That is, the hrst g constrained coordinates are optimized using component-pooled sample mean 
(components from all the classes) while those p — g unconstrained coordinates are optimized 
separately within each component using the component-wise sample mean. Note that we abuse 
the term “sample mean” here to mean Xkr, instead of Xkr'- In the maximization step, the parameter 
is hnally solved by: 

fi,, = (.^iyuai ■ 

Given the ^^krS, it is easy to solve S: 

5. _ Ef=l ESi Er=l (liMXi - UkrYi^i - t^kr) 

n 

To initialize the estimation algorithm, we hrst choose Rk-, the number of mixture components 
for each class k. For simplicity, an equal number of components are assigned to each class. The 
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constrained model is initialized by the estimated parameters from a standard Gaussian mixture 
model with the same number of components. 

We have so far discussed the model estimation in a classihcation set-up. We assume a common 
covariance matrix and a common constrained subspace for all the components in all the classes. 
Similar parameter estimations can also be applied to the clustering model. Specihcally, all the 
data are put in one “class”. In this “one-class” estimation problem, all the parameters can be 
estimated likewise, by omitting the “fc” subscript for classes. For brevity, we skip the details here. 


4.2 Variation of the Algorithm 

We have introduced the Gaussian mixture model with component means from different classes 
constrained in the same subspace. It is natural to modify the previous constraint in Q to 

^5 ■ ^kl = v]- ^lk2 = ■■■ = V] ■ flkR, = Ck,j , ( 10 ) 


where v/s are linearly independent vectors spanning an orthonormal basis, j = 1,..., g, q < p, and 
Cfcj depends on class k. That is, the projections of all the component means within class k onto 
the null space coincide at the constant c*,, where = (c^^, Ckp---, Ck^qY- In the new constraint 
(10), {vi, ■■■iVq] is the same set of vectors as used in constraint (|^, which spans the null space 
Because Ck varies with class k, the subspace in which the component means from each class 
reside differs from each other by a translation, that is, these subspaces are parallel. 

We train a constrained model for each class separately, and assume a common covariance 


matrix across all the components in all the classes. In the new constraint (10), Ck is actually 
equal to the projection of the class mean Ai'k onto the null space Similar to the previous 
estimation, in practice, we do not need the value of Ck in the parameter estimation. With the 


constraint (10), essentially, the component means in each class are now constrained in a shifted 
subspace parallel to ly. The shifting of subspace for each class is determined by c^., or the class 
mean Ai'k- Suppose the dimension of the constrained subspace is d. In general, the dimension 
that matters for classihcation in this variation of the algorithm is d + K — 1, assuming that the 
class means already span a subspace of dimension K — 1. 

We hrst subtract the class specihc means from the data in the training set, that is, do a class 
specihc centering of the data. Similarly as the algorithm outlined in Section we put all the 


15 





centered data from all the classes into one training set, find all the modes under different kernel 
bandwidths, and then apply MPCA to generate a sequence of constrained subspaces. The reason 
that we remove the class specihc means hrst is that they have already played a role in spanning the 
subspace containing all the component means. When applying MPCA, we only want to capture 
the dominant directions for the variation within the classes. 

Comparing with the parameter estimation in Section the only change that we need to make 
is that the constrained q coordinates in are now identical over r, but not over class k. For the 
hrst q coordinates, j = 1,..., g, we have: 




Xjr'=l ^kr'^kr', 

nk 


identical over r in class k . 


That is, the hrst q constrained coordinates are optimized using component-pooled sample mean 
in class k. All the other equations in the estimation remain the same. 


5 Experiments 

In this section, we present experimental results on several real and simulated data sets. The 
mixture model with subspace constrained means, reduced rank MDA, and standard MDA on the 
projection of data onto the constrained subspace, are compared for the classihcation of real data 
with moderate to high dimensions. We also visualize and compare the clustering results of our 
proposed method and the reduced rank MDA on several simulated data sets. 

The detailed methods tested in the experiments and their name abbreviations are summarized 
as follows: 


GMM-MPCA The mixture model with subspace constrained means, in which the sub¬ 
space is obtained by MPCA. 

GMM-MPCA-MEAN The mixture model with subspace constrained means, in which 


the subspace is obtained by MPCA-MEAN, as introduced in Section 3.2 


GMM-MPCA-SEP The mixture model with component means constrained by separately 


shifted subspace for each class, as introduced in Section 4.1 


MDA-RR The reduced rank mixture discriminant analysis (MDA), which is a weighted 
rank reduction of the full MDA. 
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MDA-RR-OS The reduced rank mixture discriminant analysis (MDA), which is based 


on optimal scoring (Hastie and Tibshirani, 1996), a multiple linear regression approach 


MDA-DR-MPCA The standard MDA on the projection of data onto the same con¬ 
strained subspace selected by GMM-MPCA. 

MDA-DR-MPCA-MEAN The standard MDA on the projection of data onto the same 
constrained subspace selected by GMM-MPCA-MEAN. 


Remarks: 


1 . 

2 . 


Since the most relevant work to our proposed method is reduced rank mixture discriminant 


analysis (MDA), we briefly introduce MDA-RR and MDA-RR-OS in Section 5.1 


In MDA-DR-MPGA or MDA-DR-MPGA-MEAN, the data are projected onto the con¬ 
strained subspace which has yielded the largest training likelihood in GMM-MPGA or GMM- 
MPGA-MEAN. Note that this constrained subspace is spanned by = {ug+i,..., Vp}, which 
is found by MPGA or MPGA-MEAN, rather than the discriminant subspace informative for 
classihcation. We then apply standard MDA (assume a common covariance matrix across all 
the components in all the classes) to the projected training data, and classify the test data 
projected onto the same subspace. Note that, if we project the data onto the discriminant 
subspace spanned by {'E~^Vj\j = q + 1, and then apply standard MDA to classihca¬ 

tion, it is theoretically equivalent to GMM-MPGA or GMM-MPGA-MEAN (ignoring the 
variation caused by model estimation). The reason that we conduct these comparisons is 
multi-fold: hrst, we want to see if there is advantage of the proposed method as compared 
to a relative naive dimension reduction scheme; second, when the dimension of the data is 
high, we want to investigate if the proposed method has robust estimation of S; third, we 
want to investigate the difference between the constrained subspace and the discriminant 
subspace. 


5.1 Reduced Rank Mixture Discriminant Analysis 

Reduced rank MDA is a data reduction method which allows us to have a low dimensional 
view on the classihcation of data in a discriminant subspace, by controlling the within-class spread 
of component means relative to the between class spread. We outline its estimation method in 
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Appendix D, which is a weighted rank reduction of the full mixture solution proposed by Hastie 


and Tibshirani (1996). We also show how to obtain the discriminant subspace of the reduced rank 


method in Appendix D. 


Hastie and Tibshirani 

(1996 

) applied the optimal scoring approach ( 

Breiman and Ihaka, 

1984 


to £t reduced rank MDA, which converted the discriminant analysis to a nonparametric multiple 
linear regression problem. By expressing the problem as a multiple regression, the fitting proce¬ 


dures can be generalized using more sophisticated regression methods than linear regression (Hastie 


and Tibshirani, 1996), for instance, flexible discriminant analysis (FDA) and penalized discrim¬ 


inant analysis (PDA). The use of optimal scoring also has some computational advantages, for 
instance, using fewer observations than the weighted rank reduction. A software package contain¬ 


ing a set of functions to £t MDA, FDA, and PDA by multiple regressions is provided by Hastie 


and Tibshirani (1996). 


Although the above benehts for estimating reduced rank MDA are gained from the optimal 
scoring approach, there are also some restrictions. For instance, it can not be easily extended to 
fit a mixture model for clustering since the component means and covariance are not estimated 
explicitly. In addition, when the dimension of the data is larger than the sample size, optimal 
scaling can not be used due to the lack of degrees of freedom in regression. In the following 
experiment section, we will compare our proposed methods with reduced ra nk MDA. Both our 
own implementation of reduced rank MDA based on weighted rank reduction of the full mixture, 
i.e., MDA-RR, and the implementation via optimal scoring from the software package provided 


by Hastie and Tibshirani (1996), i.e., MDA-RR-OS, are tested. 


5.2 Classification 

Eight data sets from various sources are used for classihcation. We summarize the detailed 
information of these data below. 


• The sonar data set consists of 208 patterns of sonar signals. Each pattern has 60 dimensions 
and the number of classes is two. The sample sizes of the two classes are (111, 97). 

• The robot data set has 5456 navigation instances, with 24 dimensions and four classes (826, 
2097, 2205, 328). 
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The waveform data (Hastie et ah, 2001) is a simulated three-classes data of 21 features, 


with a waveform function generating both training and test sets (300, 500). 


The imagery semantics data set (Qiao and Li, 2010) contains 1400 images each represented 


by a 64 dimensional feature vector. These 1400 images come from hve classes with different 
semantics (300, 300, 300, 300, 200). 

The parkinsons data set is composed of 195 individual voice recordings, which are of 21 
dimensions and divided into two classes (147, 48). 

The satellite data set consists of 6435 instances which are square neighborhoods of pixels, 
with 36 dimensions and six classes (1533, 703, 1358, 626, 707, 1508). 

The semeion handwritten digit data have 1593 binary images from ten classes (0-9 digits) 
with roughly equal sample size in each class. Each image is of 16 x 16 pixels and thus has 
256 dimensions. Four fifths of the images are randomly selected to form a training set and 
the remaining as testing. 


The yaleB face image data (Georghiades et ah, 2001 Lee et ah, 2005 He et ah, 2005) 


contains gray scale human face images for 38 individuals. Each individual has 64 images, 
which are of 32 x 32 pixels, normalized to unit vectors. We randomly select the images of 
five individuals, and form a data set of 250 training images and 70 test images, with equal 
sample size for each individual. 

The sonar, robot, parkinsons, satellite and semeion data are from the UCI machine learning 
repository. Among the above data sets, the semeion and yaleB data have high dimensions. The 
other data sets are of moderately high dimensions. 

For the data sets with moderately high dimensions, five-fold cross validation is used to compute 
their classification accuracy except for the waveform, whose accuracy is the average over ten 


simulations, the same setting used in (Hastie et ah, 2001). We assume a full common covariance 


matrix across all the components in all the classes. For the semeion and yaleB data sets, the 
randomly split training and test samples are used to compute their classification accuracy instead 
of cross validation due to the high computational cost. Since these two data sets are of high 
dimensions, for all the tested methods, we assume common diagonal covariance matrices across 
all the components in all the classes. For simplicity, the same number of mixture components is 
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used to model each class for all the methods. 


In our proposed methods, the constrained subspaces are found by MPCA or MPCA-MEAN, 
introduced in Section |3.1| and |3.2[ Specihcally, in MPCA, a sequence of subspaces are identihed 
from the training data by gradually increasing the kernel bandwidth u;, i.e., cxi < (J 2 < • • • < 

I = 1,2,...,7]. In practice, we set rj = 20 and choose ads equally spaced from [O.ld, 2d], where 
d is the largest sample standard deviation of all the dimensions in the data. HMAC is used to 
obtain the modes at different bandwidths. Note that in HMAC, some ai may result in the same 
clustering as a;_i, indicating that the bandwidth needs to be increased substantially so that the 
clustering result will be changed. In our experiments, only the modes at the bandwidth resulting 
in different clustering from the preceding bandwidth are employed to span the subspace. For 
the high dimensional data, since the previous kernel bandwidth range [O.ld, 2d] does not yield 
a sequence of distinguishable subspaces, we therefore increase their bandwidths. Specihcally, for 
the semeion and yaleB data, the kernel bandwidth ai is now chosen equally spaced from [4d, 5d] 
and [2d, 3d], respectively, with the interval being O.ld. In GMM-MPCA-SEP, since the modes 
are identihed from a new set of class mean removed data, for both the semeion and yaleB data, 
the kernel bandwidth ai is now chosen equally spaced from [3.1d, 5d], with the interval being 
O.ld. For the other data sets, ai is still chosen equally spaced from [O.ld,2d]. In MPCA-MEAN, 
if the dimension of the constrained subspace is smaller than the class number K, the subspace 
is obtained by applying weighted PCA only to class means. Otherwise, at each bandwidth, we 
obtain the subspace by applying weighted PCA to a union set of class means and modes, with 
60% weight allocated proportionally to the means and 40% to the modes, that is, 7 = 60. The 
subspace yielding the largest likelihood on the training data is hnally chosen as the constrained 
subspace. 


5.2.1 Classification Results 


We show the classihcation results of the tested methods in this section. The classihcation 
error rates on data sets of moderately high dimensions are shown in Tables Bi and We 
vary the discriminant dimension d and also the number of mixture components used for modeling 
each class. Similarly, Table shows the classihcation error rates on the semeion and yaleB data, 
which are of high dimensions. For all the methods except GMM-MPCA-SEP, the dimension of 


20 






the discriminant subspace equals the dimension of the constrained subspace, denoted by d. For 
GMM-MPCA-SEP, the dimension of the discriminant space is actually K — 1 + d. In order to 
compare on a common ground, for GMM-MPCA-SEP, we change the notation for the dimension 
of the constrained subspace to d', and still denote the dimension of the discriminant subspace by 
d = K — 1 + d'. The minimum number of dimensions used for classihcation in GMM-MPCA-SEP 
is therefore K — 1. In all these tables, if d is set to be smaller than iP — 1, we do not have the 
classihcation results of GMM-MPCA-SEP, which are marked by “NA”. In addition, in Table 4(b)[ 
the classihcation error rates of MDA-RR-OS on yaleB data are not reported since the dimension 
p of the data is signihcantly larger than the sample size n. The reduced rank MDA based on 
optimal scoring approach cannot be applied due to the lack of degree freedom in the regression 
step for the small n large p problem. The minimum error rate in each column is in bold font. 
From the results in these tables, we summarize our hndings as follows: 


• Comparing the three Gaussian mixture models with subspace constrained means, GMM- 
MPCA-MEAN and GMM-MPCA-SEP usually outperform GMM-MPCA, except on the 
waveform data. Since the class means are involved in spanning the constrained subspace in 
GMM-MPCA-MEAN and determine the shifting of the subspace for each class in GMM- 
MPCA-SEP, the observed advantage of GMM-MPCA-MEAN and GMM-MPCA-SEP indi¬ 
cates that class means are valuable for hnding a good subspace. 

• Comparing the proposed methods and the reduced rank MDA methods, when the discrimi¬ 
nant dimension is low, GMM-MPCA-MEAN and GMM-MPCA-SEP usually perform better 
than MDA-RR and MDA-RR-OS. When the discriminant dimension becomes higher, we do 
not observe a clear winner among different methods. The results are very data-dependent. 
Note that in GMM-MPCA-MEAN, when the discriminant dimension is smaller than iP — 1 , 
the subspace is obtained by applying weighted PGA only to the class means. For most data 
sets, when the discriminant dimension is very low, GMM-MPCA-MEAN performs best or 
close to best. 

• Comparing the proposed methods and the simple methods of hnding the subspace hrst and 
then htting MDA on the data projected onto the subspace, when the data dimension is 
moderately high and the discriminant dimension is very low, GMM-MPCA/GMM-MPCA- 
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MEAN usually perform better than MDA-DR-MPCA/MDA-DR-MPCA-MEAN. As the dis¬ 
criminant dimension increases, with certain component numbers, MDA-DR-MPCA/MDA- 
DR-MPCA-MEAN may have a better classihcation accuracy. In addition, if the data di¬ 
mension is high, for instance, the yaleB data, MDA-DR-MPCA/MDA-DR-MPCA-MEAN 
may perform better even at lower discriminant dimension. As discussed in Remark 2 of this 
section, for MDA-DR-MPCA/MDA-DR-MPCA-MEAN and GMM-MPCA/GMM-MPCA- 
MEAN, we essentially do classihcation on the data in two different subspaces, i.e., the con¬ 
strained subspace and the discriminant subspace. For GMM-MPGA/GMM-MPGA-MEAN, 
under the subspace constraint, we need to estimate a common covariance matrix, which 


affects the discriminant subspace, as shown in Section |3.3[ Generally speaking, when the 
discriminant dimension becomes higher or the data dimension is high, it becomes more dif- 
hcult to accurately estimate the covariance matrix. For instance, for the high dimensional 
data, we assume a common diagonal covariance matrix, so that the covariance estimation be¬ 
comes feasible and avoids singularity issue. However, this may result in a poor discriminant 
subspace, which leads to worse classihcation accuracy. On the other hand, when the data 
dimension is moderately high and the discriminant dimension is very low, the estimated co- 
variance matrix is more accurate and the discriminant subspace informative for classihcation 
is empirically better than the constrained subspace. 

As a hnal note, when the discriminant dimension is low, MDA-DR-MPGA-MEAN generally 
outperforms MDA-DR-MPGA. 


5.3 Sensitivity of Subspace to Bandwidths 

Diherent kernel bandwidths may result in diherent sets of modes by HMAG, which again may 
yield diherent constrained subspaces. We investigate in this section the sensitivity of constrained 
subspaces to kernel bandwidths. 

Assume two subspaces vi and 1^2 are spanned by two sets of orthonormal basis vectors 
..., and ..., where d is the dimension. To measure the closeness between 

two subspaces, we project the basis of one subspace onto the other. Specihcally, the closeness 
between i^i and 1^2 is dehned as doseness{ui, U 2 ) = ^2 span 
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Table 1: Classification error rates (%) for the data with moderately high dimensions (I) 


(a) Robots data 


1 Num of components 

d = 2 

d = 5 

d^7 

d = 9 

d = 11 

d = 13 

d = 15 

d = 17 


GMM-MPCA 

41.39 

35.78 

31.93 

31.73 

31.25 

31.54 

31.65 

31.60 


GMM-MPCA-MEAN 

30.32 

32.06 

30.11 

30.52 

31.19 

31.40 

31.69 

31.29 

3 

GMM-MPCA-SEP 

NA 

30.86 

30.68 

30.42 

29.58 

30.28 

30.97 

30.68 


MDA-RR 

41.22 

30.32 

30.85 

30.57 

29.95 

29.95 

29.95 

29.95 


MDA-RR-OS 

40.16 

32.73 

32.44 

30.35 

30.66 

30.43 

30.26 

31.01 


MDA-DR-MPCA 

44.10 

40.30 

35.04 

32.72 

33.03 

33.56 

33.39 

32.84 


MDA-DR-MPCA-MEAN 

41.22 

36.42 

34.71 

33.83 

32.75 

32.44 

33.47 

32.26 


GMM-MPCA 

40.74 

31.91 

30.77 

30.15 

29.40 

29.05 

27.91 

28.24 


GMM-MPCA-MEAN 

26.56 

31.49 

29.71 

29.98 

28.43 

28.02 

28.12 

28.39 

4 

GMM-MPCA-SEP 

NA 

31.41 

30.19 

28.45 

28.92 

30.13 

29.54 

30.39 


MDA-RR 

40.45 

33.63 

30.41 

28.28 

27.77 

27.09 

27.18 

27.18 


MDA-RR-OS 

40.91 

31.87 

31.36 

30.24 

27.88 

29.01 

28.59 

28.61 


MDA-DR-MPCA 

42.26 

36.16 

34.64 

31.95 

30.06 

28.90 

29.77 

27.56 


MDA-DR-MPCA-MEAN 

39.41 

34.38 

34.53 

31.96 

29.73 

29.45 

28.15 

28.28 


GMM-MPCA 

37.72 

29.67 

29.25 

29.31 

27.86 

27.91 

26.28 

26.21 


GMM-MPCA-MEAN 

28.72 

27.86 

26.98 

26.69 

26.83 

25.90 

26.37 

26.14 

5 

GMM-MPCA-SEP 

NA 

26.48 

27.05 

27.46 

27.22 

26.76 

26.74 

27.00 


MDA-RR 

40.39 

29.01 

26.52 

26.08 

26.03 

26.61 

26.52 

27.09 


MDA-RR-OS 

39.96 

30.99 

29.38 

28.24 

28.48 

27.59 

28.24 

27.57 


MDA-DR-MPCA 

41.07 

35.69 

32.44 

30.86 

29.03 

27.99 

28.52 

26.34 


MDA-DR-MPCA-MEAN 

38.34 

33.10 

32.18 

30.13 

28.56 

26.70 

27.05 

26.28 


(b) Waveform data 


1 Num of components 

d = 2 

d = 4 

d = 6 

d = 8 

d = 10 

d = 12 

d = 14 

d = 16 


GMM-MPCA 

15.70 

15.64 

16.12 

17.10 

17.76 

17.80 

18.24 

18.64 


GMM-MPCA-MEAN 

16.12 

16.14 

16.82 

17.38 

17.76 

17.92 

17.90 

18.84 

3 

GMM-MPCA-SEP 

NA 

17.08 

17.04 

17.22 

17.44 

17.50 

17.70 

18.34 


MDA-RR 

16.00 

18.48 

18.64 

18.58 

18.58 

18.58 

18.58 

18.58 


MDA-RR-OS 

15.50 

17.20 

18.14 

17.98 

18.00 

17.84 

18.08 

17.98 


MDA-DR-MPCA 

14.74 

15.28 

15.78 

16.14 

16.58 

17.12 

17.62 

17.82 


MDA-DR-MPCA-MEAN 

14.74 

15.50 

15.76 

16.50 

17.00 

16.94 

17.26 

17.48 


GMM-MPCA 

15.56 

16.28 

16.06 

16.94 

17.84 

17.54 

18.58 

19.32 


GMM-MPCA-MEAN 

15.84 

16.70 

16.90 

17.28 

17.96 

18.34 

18.36 

18.84 

4 

GMM-MPCA-SEP 

NA 

16.34 

17.14 

17.56 

17.56 

18.02 

18.16 

18.16 


MDA-RR 

15.80 

18.12 

18.28 

19.06 

19.26 

19.66 

19.66 

19.66 


MDA-RR-OS 

15.50 

17.54 

18.36 

18.36 

19.34 

18.92 

18.72 

18.88 


MDA-DR-MPCA 

15.18 

15.78 

16.00 

16.36 

17.12 

17.64 

17.64 

18.26 


MDA-DR-MPCA-MEAN 

15.12 

15.86 

16.16 

16.70 

17.00 

17.56 

17.66 

18.40 


GMM-MPCA 

16.44 

16.72 

16.42 

16.96 

17.56 

17.86 

18.66 

18.52 


GMM-MPCA-MEAN 

16.26 

16.30 

17.32 

17.72 

18.04 

17.68 

18.28 

19.04 

5 

GMM-MPCA-SEP 

NA 

17.24 

16.96 

17.32 

17.40 

17.66 

17.68 

18.30 


MDA-RR 

16.76 

18.18 

18.26 

19.14 

19.16 

19.70 

19.78 

19.78 


MDA-RR-OS 

15.80 

17.78 

18.62 

19.02 

19.30 

18.92 

18.92 

18.40 


MDA-DR-MPCA 

15.34 

15.86 

15.98 

16.66 

17.16 

16.90 

17.90 

18.80 


MDA-DR-MPCA-MEAN 

15.08 

15.70 

16.76 

16.16 

17.30 

17.90 

17.56 

18.38 


the same subspace, ~ * = 1) 2, d. If they are orthogonal to each other, 

= 0, for i = 1,2, Therefore, the range of closeness{ui, U 2 ) is (0,(i). The 
higher the value, the closer the two subspaces are. 
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Table 2: Classification error rates (%) for the data with moderately high dimensions (II) 


(a) Sonar data 


Num of components 


d = 2 

d = 3 

d = 5 

d^ 7 

d = 9 

d = 11 

d = 13 

d = 15 


GMM-MPCA 


39.29 

39.78 

24.56 

25.48 

24.51 

21.61 

21.11 

21.13 


GMM-MPCA-MEAN 


35.92 

23.57 

23.54 

24.04 

23.09 

22.12 

21.63 

21.63 


GMM-MPCA-SEP 


NA 

27.85 

25.45 

24.54 

24.06 

24.55 

23.56 

24.02 

3 

MDA-RR 


36.48 

28.82 

22.08 

22.08 

22.08 

22.08 

22.08 

22.08 


MDA-RR-OS 


45.16 

25.87 

22.61 

24.05 

20.68 

23.60 

22.59 

23.58 


MDA-DR-MPCA 


42.31 

38.45 

19.71 

18.33 

19.77 

22.18 

20.71 

17.76 


MDA-DR-MPCA-MEAN 


39.43 

23.56 

18.77 

18.33 

18.83 

19.78 

21.20 

16.81 


GMM-MPCA 


40.53 

38.88 

20.19 

20.72 

18.32 

18.75 

17.33 

19.74 


GMM-MPCA-MEAN 


35.08 

25.45 

20.20 

17.83 

17.37 

18.26 

17.31 

20.71 


GMM-MPCA-SEP 


NA 

26.51 

22.62 

22.16 

20.25 

19.75 

20.71 

19.25 

4 

MDA-RR 


46.21 

27.91 

23.07 

19.27 

19.27 

19.27 

19.27 

19.27 


MDA-RR-OS 


42.80 

26.35 

26.44 

19.23 

21.62 

22.10 

19.25 

22.58 


MDA-DR-MPCA 


37.50 

37.42 

22.11 

18.33 

18.82 

21.21 

21.23 

20.26 


MDA-DR-MPCA-MEAN 


40.85 

22.11 

20.24 

19.28 

20.24 

19.76 

20.73 

19.31 


GMM-MPCA 


44.77 

39.78 

24.56 

25.48 

24.51 

21.61 

21.11 

21.13 


GMM-MPCA-MEAN 


35.42 

27.89 

21.15 

19.73 

18.78 

19.71 

18.26 

18.76 


GMM-MPCA-SEP 


NA 

32.31 

29.35 

20.21 

20.22 

20.21 

19.23 

21.17 

5 

MDA-RR 


43.70 

27.38 

25.91 

22.06 

19.67 

19.67 

19.67 

19.67 


MDA-RR-OS 


35.55 

29.34 

24.86 

22.12 

20.68 

22.19 

21.18 

20.17 


MDA-DR-MPCA 


36.05 

35.07 

21.20 

19.29 

20.73 

23.09 

20.71 

18.18 


MDA-DR-MPCA-MEAN 


38.37 

26.44 

21.21 

18.34 

23.10 

24.56 

21.64 

18.74 


(b) Imagery data 


1 Num of components 

d = 2 

d = 4 

d = 6 

d = 8 

d = 10 

d = 12 

d = 14 

d = 16 


GMM-MPCA 

55.36 

48.00 

40.36 

38.64 

38.36 

37.43 

36.07 

37.86 


GMM-MPCA-MEAN 

44.50 

36.21 

36.86 

37.07 

36.36 

36.79 

36.71 

36.14 

3 

GMM-MPCA-SEP 

NA 

NA 

35.21 

34.07 

35.57 

35.79 

35.14 

35.64 


MDA-RR 

52.57 

43.14 

40.21 

35.86 

35.86 

35.71 

35.29 

35.29 


MDA-RR-OS 

52.36 

42.50 

38.50 

34.07 

35.29 

35.50 

34.93 

34.79 


MDA-DR-MPCA 

59.93 

49.36 

42.21 

41.71 

41.00 

39.50 

37.00 

38.14 


MDA-DR-MPCA-MEAN 

49.36 

44.14 

40.86 

40.93 

38.64 

38.50 

37.79 

38.07 


GMM-MPCA 

57.00 

48.29 

39.79 

38.14 

36.57 

36.93 

35.64 

36.64 


GMM-MPCA-MEAN 

45.00 

37.00 

39.21 

36.57 

35.36 

35.43 

35.86 

36.14 

4 

GMM-MPCA-SEP 

NA 

NA 

35.00 

35.43 

35.07 

35.50 

35.43 

35.00 


MDA-RR 

52.21 

40.64 

38.93 

35.79 

37.50 

36.50 

35.29 

34.86 


MDA-RR-OS 

51.64 

43.57 

37.64 

35.50 

34.50 

32.36 

34.50 

33.64 


MDA-DR-MPCA 

59.71 

50.00 

40.14 

40.36 

38.29 

37.86 

36.29 

37.64 


MDA-DR-MPCA-MEAN 

49.71 

42.36 

39.71 

39.71 

38.64 

37.43 

37.21 

37.93 


GMM-MPCA 

57.79 

48.50 

40.36 

37.57 

37.36 

39.07 

36.07 

38.29 


GMM-MPCA-MEAN 

45.64 

36.57 

38.64 

36.14 

37.00 

36.64 

35.64 

35.36 

5 

GMM-MPCA-SEP 

NA 

NA 

35.79 

35.14 

34.43 

34.36 

35.57 

35.43 


MDA-RR 

53.21 

43.36 

39.00 

36.07 

35.86 

34.43 

33.93 

34.07 


MDA-RR-OS 

52.07 

42.57 

39.71 

34.21 

32.64 

34.21 

33.50 

32.93 


MDA-DR-MPCA 

58.50 

48.93 

39.79 

38.21 

39.57 

39.07 

36.00 

38.71 


MDA-DR-MPCA-MEAN 

50.00 

42.86 

39.21 

38.36 

39.00 

37.57 

37.64 

36.21 


In our proposed methods, a collection of constrained subspaces are obtained through MPCA 
or MPCA-MEAN at different kernel bandwidth crds, I = 1,2,and ci < a 2 < ■ ■ ■ < Crj. 
To measure the sensitivity of subspaces to different bandwidths, we compute the mean closeness 
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Table 3: Classification error rates (%) for the data with moderately high dimensions (III) 


(a) Parkinsons data 


Num of components 


d = 2 

d = 3 

d = 5 

d - 7 

d = 9 

d = 11 

d = 13 

d = 15 


GMM-MPCA 


17.96 

17.42 

14.33 

14.84 

16.98 

14.92 

15.47 

12.84 


GMM-MPCA-MEAN 


18.90 

14.84 

11.75 

13.33 

13.88 

12.88 

13.89 

12.85 


GMM-MPCA-SEP 


NA 

11.75 

13.29 

12.26 

13.34 

13.85 

11.25 

13.34 

3 

MDA-RR 


19.42 

15.96 

13.88 

13.88 

13.88 

13.88 

13.88 

13.88 


MDA-RR-OS 


16.88 

16.42 

12.31 

13.89 

12.31 

13.89 

13.37 

12.31 


MDA-DR-MPCA 


19.47 

17.90 

13.81 

14.35 

15.37 

14.83 

15.38 

16.41 


MDA-DR-MPCA-MEAN 


19.47 

17.90 

13.81 

14.33 

15.37 

15.35 

15.35 

15.35 


GMM-MPCA 


17.88 

14.77 

14.31 

14.81 

12.84 

11.30 

10.28 

12.32 


GMM-MPCA-MEAN 


14.81 

14.31 

13.83 

12.81 

9.25 

9.28 

8.74 

10.76 


GMM-MPCA-SEP 


NA 

11.28 

11.33 

12.29 

11.80 

10.29 

9.76 

9.79 

4 

MDA-RR 


16.85 

12.81 

11.79 

10.29 

9.79 

9.79 

9.79 

9.79 


MDA-RR-OS 


18.41 

15.38 

10.74 

10.79 

11.84 

11.83 

12.85 

10.32 


MDA-DR-MPCA 


19.47 

18.47 

12.29 

12.35 

10.77 

11.23 

10.72 

12.30 


MDA-DR-MPCA-MEAN 


19.47 

17.43 

12.29 

11.81 

9.72 

10.24 

10.72 

11.76 


GMM-MPCA 


19.39 

18.39 

14.84 

17.37 

15.31 

12.81 

11.25 

12.79 


GMM-MPCA-MEAN 


18.39 

16.34 

13.26 

14.83 

11.78 

11.25 

10.25 

11.29 


GMM-MPCA-SEP 


NA 

14.81 

10.75 

12.25 

11.75 

10.75 

10.25 

10.78 

5 

MDA-RR 


19.94 

16.30 

12.27 

13.83 

11.28 

10.78 

11.28 

10.79 


MDA-RR-OS 


18.96 

16.43 

14.30 

11.22 

10.25 

12.33 

9.71 

9.74 


MDA-DR-MPCA 


18.96 

18.47 

13.80 

11.81 

11.28 

12.77 

10.70 

10.76 


MDA-DR-MPCA-MEAN 


18.96 

19.52 

12.26 

11.31 

9.75 

12.27 

10.20 

9.74 


(b) Satellite data 


Num of components 

d = 2 

d = 4 

d - 7 

d = 9 

d = 11 

d = 13 

d = 15 

d = 17 


GMM-MPCA 

16.74 

15.01 

14.16 

14.67 

14.06 

13.95 

13.97 

13.63 


GMM-MPCA-MEAN 

16.94 

14.10 

13.53 

13.77 

13.95 

13.78 

13.66 

13.68 

3 

GMM-MPCA-SEP 

NA 

NA 

15.48 

13.58 

13.80 

13.58 

13.71 

13.67 


MDA-RR 

35.18 

14.41 

12.84 

12.96 

13.46 

13.60 

13.66 

13.53 


MDA-RR-OS 

34.90 

13.95 

13.01 

13.09 

12.82 

13.04 

13.35 

13.29 


MDA-DR-MPCA 

17.20 

14.83 

13.61 

13.91 

13.80 

13.35 

13.60 

13.69 


MDA-DR-MPCA-MEAN 

17.09 

14.42 

13.58 

14.12 

14.06 

13.41 

13.38 

13.13 


GMM-MPCA 

17.02 

14.13 

13.61 

13.80 

13.58 

12.90 

12.93 

12.88 


GMM-MPCA-MEAN 

17.31 

13.41 

13.50 

13.53 

13.24 

12.94 

12.91 

12.87 

4 

GMM-MPCA-SEP 

NA 

NA 

15.40 

12.93 

13.08 

13.35 

13.38 

13.54 


MDA-RR 

35.06 

13.35 

12.60 

12.77 

12.74 

12.73 

12.63 

13.05 


MDA-RR-OS 

34.28 

13.49 

11.95 

12.17 

11.90 

12.49 

11.97 

12.14 


MDA-DR-MPCA 

17.54 

14.14 

13.21 

13.52 

13.05 

12.74 

12.46 

12.45 


MDA-DR-MPCA-MEAN 

17.37 

13.36 

13.53 

13.57 

13.07 

12.43 

12.45 

12.82 


GMM-MPCA 

16.25 

13.66 

12.90 

13.29 

12.79 

12.26 

11.92 

12.24 


GMM-MPCA-MEAN 

16.77 

12.93 

12.85 

12.96 

12.40 

12.18 

11.89 

12.24 

5 

GMM-MPCA-SEP 

NA 

NA 

15.48 

13.21 

12.56 

12.70 

12.59 

12.49 


MDA-RR 

27.43 

13.27 

12.85 

12.34 

12.15 

12.18 

12.29 

12.28 


MDA-RR-OS 

30.16 

13.30 

12.31 

12.23 

11.73 

11.89 

11.98 

11.97 


MDA-DR-MPCA 

16.61 

13.80 

12.70 

12.82 

12.74 

12.09 

11.79 

12.42 


MDA-DR-MPCA-MEAN 

16.58 

12.82 

12.99 

12.93 

12.66 

11.92 

11.92 

12.31 


between the subspace found at cx; and all the other subspaces at 
1 ,2,...,/ — 1. A large mean closeness indicates that the current 
subspaces. Table lists the mean closeness of subspaces by MPCA 


preceding bandwidths a;/,/' = 
subspace is close to preceding 
and MPCA-MEAN at different 


25 














Table 4: Classification error rates (%) for the data with high dimensions 


(a) Semeion data 


Num of components 

d = 2 

d = 4 

d = 8 

d = 11 

d = 13 

d = 15 

d = 17 

d = 19 


GMM-MPCA 

53.56 

29.72 

18.27 

19.81 

18.89 

19.20 

18.89 

18.27 


GMM-MPCA-MEAN 

49.54 

29.10 

13.31 

14.86 

16.72 

14.86 

16.72 

16.10 

3 

GMM-MPCA-SEP 

NA 

NA 

NA 

13.31 

12.07 

13.00 

16.10 

14.86 


MDA-RR 

45.51 

26.93 

15.79 

14.86 

13.93 

15.79 

14.24 

13.62 


MDA-RR-OS 

48.36 

24.92 

13.93 

12.41 

10.60 

11.10 

10.59 

11.41 


MDA-DR-MPCA 

49.54 

27.86 

19.20 

17.03 

17.03 

15.79 

16.72 

16.41 


MDA-DR-MPCA-MEAN 

48.30 

26.01 

12.07 

13.62 

13.31 

12.07 

14.24 

14.55 


GMM-MPCA 

53.56 

26.32 

17.03 

16.10 

16.10 

16.10 

16.10 

15.17 


GMM-MPCA-MEAN 

51.39 

25.70 

11.46 

11.76 

12.69 

13.00 

14.24 

15.17 

4 

GMM-MPCA-SEP 

NA 

NA 

NA 

13.31 

12.07 

12.07 

13.62 

11.76 


MDA-RR 

49.23 

26.01 

13.93 

13.62 

13.00 

12.07 

13.62 

12.07 


MDA-RR-OS 

48.70 

24.83 

14.21 

11.60 

10.59 

11.16 

9.97 

11.09 


MDA-DR-MPCA 

46.75 

26.32 

17.34 

16.41 

16.41 

15.17 

16.10 

15.17 


MDA-DR-MPCA-MEAN 

44.58 

26.32 

13.00 

11.76 

15.17 

13.31 

13.00 

13.00 


GMM-MPCA 

51.70 

24.46 

15.79 

13.62 

15.17 

15.17 

13.93 

13.00 


GMM-MPCA-MEAN 

43.03 

26.63 

11.15 

11.46 

12.38 

12.07 

13.31 

13.00 

5 

GMM-MPCA-SEP 

NA 

NA 

NA 

13.00 

11.46 

13.00 

12.38 

13.00 


MDA-RR 

48.92 

25.39 

13.00 

12.69 

11.46 

10.53 

12.07 

12.69 


MDA-RR-OS 

49.16 

26.53 

14.21 

11.10 

10.60 

10.53 

9.84 

9.96 


MDA-DR-MPCA 

48.61 

27.24 

18.58 

13.93 

14.24 

13.62 

13.93 

10.84 


MDA-DR-MPCA-MEAN 

46.13 

25.08 

10.22 

10.84 

10.84 

10.53 

8.98 

9.29 


(b) YaleB data 


Num of components 

d = 2 

d = 4 

d = 6 

d = 8 

d = 10 

d = 12 

d = 14 

d = 16 


GMM-MPCA 

84.29 

64.29 

64.29 

55.71 

45.71 

38.57 

40.00 

34.29 


GMM-MPCA-MEAN 

31.43 

17.14 

52.86 

51.43 

38.57 

30.00 

28.57 

27.14 


GMM-MPCA-SEP 

NA 

NA 

27.14 

20.00 

21.43 

20.00 

20.00 

20.00 


MDA-RR 

87.14 

42.86 

27.14 

17.14 

28.57 

8.57 

11.43 

11.43 


MDA-DR-MPCA 

82.86 

58.57 

50.00 

44.29 

37.14 

42.86 

25.71 

32.86 


MDA-DR-MPCA-MEAN 

30.00 

17.14 

60.00 

37.14 

40.00 

21.43 

17.14 

14.29 


GMM-MPCA 

84.29 

67.14 

68.57 

55.71 

44.29 

44.29 

40.00 

37.14 


GMM-MPCA-MEAN 

34.29 

22.86 

64.29 

50.00 

35.71 

30.00 

35.71 

30.00 


GMM-MPCA-SEP 

NA 

NA 

31.43 

25.71 

28.57 

27.14 

24.29 

25.71 


MDA-RR 

85.71 

60.00 

41.43 

24.29 

14.29 

10.00 

12.86 

11.43 


MDA-DR-MPCA 

90.00 

55.71 

50.00 

42.86 

37.14 

41.43 

28.57 

27.14 


MDA-DR-MPCA-MEAN 

25.71 

21.43 

60.00 

35.71 

32.86 

11.43 

11.43 

12.86 


GMM-MPCA 

85.71 

65.71 

65.71 

55.71 

50.00 

45.71 

42.86 

40.00 


GMM-MPCA-MEAN 

37.14 

14.29 

60.00 

51.43 

47.14 

42.86 

41.43 

38.57 


GMM-MPCA-SEP 

NA 

NA 

31.43 

35.71 

30.00 

32.86 

28.57 

35.71 


MDA-RR 

85.71 

61.43 

42.86 

42.86 

32.86 

30.00 

22.86 

24.29 


MDA-DR-MPCA 

87.14 

67.14 

52.86 

38.57 

34.29 

34.29 

17.14 

22.86 


MDA-DR-MPCA-MEAN 

27.14 

18.57 

50.00 

34.29 

27.14 

21.43 

20.00 

7.14 


bandwidth levels for the sonar and imagery data (the training set from one fold in the previous 
hve-fold cross validation setup). The hrst values in the parentheses are by MPCA while the 
second values are by MPCA-MEAN. We vary the dimension of the constrained subspace. The 
number of modes identihed at each level is also shown in the tables. As Table shows, for both 
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methods, the subspaces found at the hrst few levels are close to each other, indicated by their large 
mean closeness values, which are close to d, the dimension of the subspace. As the bandwidth ai 
increases, the mean closeness starts to decline, which indicates that the corresponding subspace 
changes. When ai is small, the number of modes identihed by HMAC is large. The modes and their 
associated weights do not change much. As a result, the generated subspaces at these bandwidths 
are relatively stable. As ai increases, the kernel density estimate becomes smoother, and more 
data points tend to ascend to the same mode. We thus have a smaller number of modes with 
changing weights, which may yield a substantially different subspace. Additionally, the subspace 
by MPCA-MEAN is spanned by applying weighted PCA to a union set of modes and class means. 
In our experiment, we have allocated a larger weight proportionally to class means (in total, 60%) 
and the class means remain unchanged in the union set at each kernel bandwidth. Therefore, the 
differences between subspaces by MPCA-MEAN are smaller than that by MPCA, indicated by 
larger closeness values. 

5.4 Model Selection 

In our proposed method, the following model selection strategy is adopted. We take a sequence 
of subspaces resulting from different kernel bandwidths, and then estimate a mixture model con¬ 
strained by each subspace and hnally choose a model yielding the maximum likelihood. In this 
section, we examine our model selection criteria, and the relationships among test classihcation 
error rates, training likelihoods and kernel bandwidths. 

Figure [T] shows the test classihcation error rates at different levels of kernel bandwidth for 
several data sets (from one fold in the previous hve-fold cross validation setup), when the number 
of mixture components for each class is set to three. The error rates are close to each other at 
the hrst few levels. As the kernel bandwidth increases, the error rates start to change. Except for 
the waveform, on which the error rates of GMM-MPCA and GMM-MPCA-MEAN are very close, 
for the other data sets in Figure the error rate of GMM-MPGA-MEAN at each bandwidth 
level is lower than that of GMM-MPGA. Similarly, at each kernel bandwidth level, the error rate 
of GMM-MPGA-SEP is also lower than that of GMM-MPGA, except for the robot data. We 
also show the training log-likelihoods of these methods with respect to different kernel bandwidth 
levels in Figure The training log-likelihoods are also stable at the hrst few levels and start to 
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fluctuate as the bandwidth increases. This is due to the possible big change in subspaces under 
large kernel bandwidths. 

In our model selection strategy, the subspace which results in the maximum log likelihood of 
the training model is selected and then we apply the model under the constraint of that specihc 
subspace to classify the test data. In Figure the test error rate of the model which has the 
largest training likelihood is indicated by an arrow. As we can see, for each method, this error rate 
is mostly ranked in the middle among all the error rates at different levels of bandwidth, which 
indicate that our model selection strategy helps hnd a reasonable training model. 

Table 5: Mean closeness of subspaces by MPCA and MPCA-MEAN at different levels of kernel bandwidth 


(a) Sonar data 


Bandwidth level 

2 

4 

6 

8 

10 

12 

14 

Num of modes 

158 

144 

114 

86 

60 

15 

8 

d = 2 

{2.00, 2.00) 

(2.00, 2.00) 

(2.00, 2.00) 

(1.99, 1.99) 

(1.98, 1.98) 

(1.77, 1.88) 

(1.43, 1.80) 

d = 4: 

(4.00, 4.00) 

(4.00, 4.00) 

(3.99, 4.00) 

(3.98, 3.99) 

(3.84, 3.94) 

(2.63, 3.08) 

(2.09, 2.96) 

d = 6 

(6.00, 6.00) 

(5.99, 5.99) 

(5.95, 5.99) 

(5.91, 5.88) 

(5.41, 5.47) 

(4.48, 4.64) 

(3.48, 4.13) 

d = 8 

(8.00, 8.00) 

(8.00, 8.00) 

(7.97, 7.97) 

(7.86, 7.89) 

(6.98, 7.00) 

(6.20, 6.40) 

(4.29, 5.18) 


(b) Imagery data 


Bandwidth level 

2 

4 

6 

8 

10 

12 

14 

Num of modes 

1109 

746 

343 

144 

60 

35 

14 

d = 6 

(6.00, 6.00) 

(5.97, 5.99) 

(5.32, 5.86) 

(5.34, 5.61) 

(5.21, 5.32) 

(5.02, 5.32) 

(3.63, 5.15) 

d = 8 

(8.00, 8.000) 

(7.96, 7.96) 

(7.54, 7.89) 

(7.31, 7.76) 

(6.82, 7.43) 

(6.27, 6.85) 

(4.83, 6.61) 

d = 10 

(10.00, 10.00) 

(9.80, 9.52) 

(9.56, 9.48) 

(9.25, 9.23) 

(8.45, 9.01) 

(7.71, 8.42) 

(5.86, 7.55) 

d = 12 

(12.00, 12.00) 

(11.96, 11.96) 

(11.48, 11.50) 

(10.29, 10.89) 

(10.06, 10.33) 

(9.49, 9.87) 

(7.42, 8.98) 


5.5 Clustering 

We present the clustering results of GMM-MPCA and MDA-RR on several simulated data 
sets and visualize the results in a low-dimensional subspace. The previous model selection criteria 
is also used in clustering. After htting a subspace constrained Gaussian mixture model, all the 
data points having the highest posterior probability belonging to a particular component form one 
cluster. We outline the data simulation process as follows. 

The data is generated from some existing subspace constrained model. Specihcally, we take 
the training set of the imagery data from one fold in the previous hve-fold cross validation setup 
and estimate its distribution by htting a mixture model via GMM-MPGA. We will obtain hve 
estimated component means which are ensured to be constrained in a two dimensional subspace. 
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Figure 1: The test classification error rates at different levels of kernel bandwidth 


A set of 200 samples are randomly drawn from a mnltivariate Ganssian distribntion with the 
previonsly estimated component means as the sample means. A common identity covariance is 
assnmed for the Ganssian mnltivariate distributions. We generate hve sets of samples in this way, 
forming a data set of 1000 samples. We scale the component means by different factors so that 
the data have different levels of dispersion among the clusters. The lower the dispersion, the more 
difficult the clustering task. Specihcally, the scaling factor is set to be 0.125, 0.150, and 0.250, 
respectively, generating three simulated data with low, middle and high level dispersion between 
clusters. 

Figure shows the clustering results of three simulated data sets by GMM-MPGA and MDM- 
RR, in two-dimensional plots, color-coding the clusters. The data projected onto the true dis¬ 
criminant subspace with true cluster labels are shown in Figure [^a). In addition. Figure [^b) 
and Figure |^c) show the data projected onto the two-dimensional discriminant subspaces by 
GMM-MPGA and MDA-RR. For all the simulated data sets, both GMM-MPGA and MDA-RR 
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Figure 2: Training log-likelihoods at different kernel bandwidth levels 


can effectively reveal the clustering structure in a low-dimensional subspace. To evaluate their 
clustering performance, we compute the clustering accuracy by comparing their predicted and true 
clustering labels. Suppose the true cluster label of data point Xi is and the predicted cluster 
label is pj, the clustering error rate is calculated as 1 — TT^o,p{pi))/n, where n is the total 

number of data points, I{x, y) is an indicator function that is equal to one if x = y otherwise zero, 
and map{pi) is a permutation function which maps the predicted label to an equivalent label in the 
data set. Specifically, we use the Kuhn-Munkres algorithm to find the best matching ( |Lovasz and 
Plummer, 1986| . The clustering error rates are listed in the titles above the plots in Figure]^ The 
mis-clustered data points are in gray. When the dispersion between clusters is low or middle, the 
clustering error rates of GMM-MPCA are smaller than those of MDA-RR. When the dispersion 
is high, the task becomes relatively easy and the clustering accuracy of these two methods are 
the same. In Table we also show the closeness between the true discriminant subspace and the 
discriminant subspaces found by GMM-MPCA and MDA-RR. Comparing with MDA-RR, for all 
the three data sets, the closeness between the subspace by GMM-MPCA and the true subspace 
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Simulation data I in true discriminant subspace 



-12 -10 -8 -6 -4 -2 0 2 4 6 8 


Simulation data II in true discriminant subspace 



Cb 


-20 -15 -10 -5 0 5 10 


Simulation data III in true discriminant subspace 




GMM-MPCA 2D clustering on simulation data III (error rate =0.60%) 



MDA-RR 2D clustering on simulation data III (error rate =0.60%) 



-30 -25 -20 -15 


(a) 


(b) 


(c) 


Figure 3: Two-dimensional plot for the clustering of synthetic data, color-coding the clusters, (a) Projec¬ 
tions onto the two-dimensional true discriminant subspace, with true cluster labels, (b), (c) Projections 
onto the two-dimensional discriminant subspace by GMM-MPCA and MDA-RR respectively, with pre¬ 
dicted cluster labels. 


are smaller. 

6 Conclusion 

In this paper, we propose a Gaussian mixture model with the component means constrained in 
a pre-selected subspace. We prove that the modes, the component means of a Gaussian mixture. 
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Table 6; Closeness between subspaces in clustering with different dispersions 


Closeness 

low 

middle 

high 

GMM-MPCA 

1.769 

1.820 

1.881 

MDA-RR 

1.552 

1.760 

1.866 


and the class means all lie in the same constrained subspace. Several approaches to hnding the 
subspace are proposed by applying weighted PCA to the modes, class means, or a union set of 
modes and class means. The constrained method results in a dimension reduction property, which 
allows us to view the classihcation or clustering structure of the data in a lower dimensional 
space. An EM-type algorithm is derived to estimate the model, given any constrained subspace. 
In addition, the Gaussian mixture model with the component means constrained by separate 
parallel subspace for each class is investigated. Although reduced rank MDA is a competitive 
classihcation method by constraining the class means to an optimal discriminant subspace within 
each EM iteration, experiments on several real data sets of moderate to high dimensions show 
that when the dimension of the discriminant subspace is very low, it is often outperformed by our 
proposed method with a simple technique of spanning the constrained subspace using only class 
means. 

We select the constrained subspace which has the largest training likelihood among a sequence 
of subspaces resulting from different kernel bandwidths. If the number of candidate subspaces is 
large, it may be desired to narrow down the search by incorporating some prior knowledge. For 
instance, the proposed method may have a potential in visualization when users already know 
that only a certain dimensions of the data matter for classihcation or clustering, i.e., a constrained 
subspace can be obtained beforehand. Finally, we expect this subspace constrained method can 
be extended to other parametric mixtures, for instance, mixture of Poisson for discrete data. 
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SUPPLEMENTAL MATERIALS LIST 


Appendix to Gaussian Mixture Models with Component Means Constrained in Pre¬ 
selected Subspaces: In Appendix A, we prove Theorem 3.1. The proof of Theorem 3.2 is 
provided in Appendix B. The derivation of of Ukr in GEM is in Appendix C and reduced rank 
mixture discriminant analysis is in Appendix D. 
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SUPPLEMENTAL MATERIALS 


Appendix A 

We prove Theorem 3.1. Consider a mixture of Gaussians with a common covariance matrix S 
shared across all the components as in (2): 

R 

f{X = x) = Y^ nr(i){x\fir, S) . 

r=l 

Once S is identihed, a linear transform (a “whitening” operation) can be applied to X so that the 
transformed data follow a mixture with component-wise diagonal covariance, more specihcally, 
the identity matrix I. Assume S is non-singular and hence positive dehnite, we can hnd the 
matrix square root of S, that is, S = (I]2)*S2. If the eigen decomposition of S is S = 
then, S 2 = D^V^. Let W = and Z = WX. The density of Z is g{Z = z) = 

I). Any mode of g(z) corresponds to a mode of f(x) and vice versa. Hence, 
without loss of generality, we can assume S = /. 

Another linear transform on Z can be performed using the orthonormal basis V = ly U = 
{t^i,..., where ly = ..., 1 ;^} is the constrained subspace where reside, and ly-^- = 

{t^i, ...,Vg} is the corresponding null subspace, as dehned in Section 3. Suppose Z = Projy. For 
the transformed data z, the covariance matrix is still I. Again, there is a one-to-one correspon¬ 
dence (via the orthonormal linear transform) between the modes of gk{z) and the modes of gk{z). 
The density of z is 

R 

g[Z = z) = Y^ nr(i){z\0r, I) , 

r=l 

where 6 r is the projection of W onto the orthonormal basis V, i.e., = Proj^^’’. Split 6 r 

into two parts, 6 r^i being the hrst q dimensions of 6 j. and 6 ^ 2 being the last p — q dimensions. 
Since the projections of /x^’s onto the null subspace ly^ are the same, are identical for all the 
components, which is hence denoted by 6 . 1 . Also denote the hrst q dimensions of z by z^^\ and 
the last p — q dimensions by z^'^\ We can write g(z) as 

R 

g(Z = z) = 5^7r,0(2«|0.,i,/,)0(2(2)|0,,2, V,) . 

r=l 
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where Iq indicates a. q x q identity matrix. Since g{z) is a smooth function, its modes have zero 
hrst order derivatives. Note 


dg{z) d(j){z^^^\e.^ulq) 


R 


dz 


( 1 ) 


dz 


( 1 ) 


y^^7lr4>{z^^\6r,2,I: 


p-qj ? 


r=l 


dg{z) 


dz 


R 




( 2 ) 


TT^ 


d(j){z^^^\er,2,Ip-,j 


r=l 


dz 


( 2 ) 


By setting the hrst partial derivative to zero and using the fact |0,.,2,-fp-g) > 0, we 

get 

f)x(zd)\a . i , 

= 0 , 


d2)i 






and equivalently 

2*-^^ = 6.1 , the only mode of a Gaussian density. 

For any modes of g{z), the hrst part 2*-^^ all equal to O. i, that is, the projections of the modes 
onto the null subspace coincide at 6 . i. Hence the modes and component means lie in the same 
constrained subspace u. 


Appendix B 


We prove Theorem 3.2 here. Assume ix = {vq^i, ...,Vp} is the constrained subspace where 
reside, and = {vi,...,Vq} is the corresponding null subspace, as dehned in Section 3. 
We use the Bayes classihcation rule to classify a sample x: y = aigmaxi^fiY = /c|X = x) = 
argmax^ /(X = x, F = k). 

Rk 

f{X = x,Y = k) = ttkfkix) oc ak'^'7Tkrexp{-{x - . (11) 

r=l 


Let W 



. Matrix V is orthonormal because Vj^s are orthonormal by construction. Consider 


the following cases of S. 
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B.l E is an identity matrix 


From Eq. (11), we have 


Rk 


J2'^krexp{-{x - 

r=l 

Rk 

7rfc,exp(-(a; - 

r=l 

Rk 

Y^krexp{-{Vx - v^i,,yyvx - V^l^,y) 

r=l 

Rk p 

^7rfcrexp(-y)(5y - fikrjf) > 


( 12 ) 


r=l 


j=l 


where Xj = v^- ■ x, = v^j ■ ^^kr) 3 — 1)2, ...,p. Because flkrj = Cj, identical across all k and r 


for j = 1, • • • , g, the hrst q terms in the sum of exponent in Eq. (12) are all constants. We have 

Rk p 

Y ^kr exp(- - f^kr,jY) 

r=l j=l 

Rk P 


OC 


^7rfc^exp(- Y - hr,jT) ■ 

j=q+l 


r=l 


Therefore, 


Rk p 

f{X = X,Y = k) (X UkY ^kr exp(- Y hj “ hr,jf) ■ 

r=l j=qy-l 

That is, to classify a sample x, we only need the projection of x onto the constrained subspace 

= {ni,...,nj. 


B.2 E is a non-identity matrix 


We can perform a linear transform (a “whitening” operation) on X so that the transformed 
data have an identity covariance matrix/. Find the matrix square root of S, that is, S = (S 2 )*S 2 . 
If the eigen decomposition of S is S = VsDsV^, then S 2 = D^V-^. Let Z = (S~2)*x. The 
distribution of Z is 


Rk 

g{Z = Z,Y = k) = akY ^fer-0(2|Afcr) I) ) 

r=l 
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where = (S According to our assumption, u* ■ = cj, i.e., identical across all k 

and r for j = Plugging into = (S^)*Afcr, we get = Cj, j = 

This means for the transformed data, the component means Afcr’® have a null space spanned by 
{'E^VjlJ = Correspondingly, the constrained subspace is spanned by {{'E~^yvj\j = 

q + 1, It is easy to verify that the new null space and constrained subspace are orthogonal, 

since CS^VjY ■ {Yi~^yvji = ■ Vj/ = 0, j = l,...q and j' = q + The spanning vectors 

for the constrained subspace, j = q + 1, ...,p, are not orthonormal in general, but there 

exists an orthonormal basis that spans the same subspace. With a slight abuse of notation, we 
use = q + l,...,p} to denote a p x {p — q) matrix containing the column vector 

For any matrix A of dimension p x d, d < p, let the notation orth{A) denote a p x d 
matrix whose column vectors are orthonormal and span the same subspace as the column vectors 
of A. According to for the transformed data Z, we only need the projection of Z onto a 
subspace spanned by the column vectors of orfh({(S~ 2 )*'y^.|j = q + l,...,p}) to compute the 
class posterior. Note that Z = So the subspace that matters for classihcation for the 

original data X is spanned by the column vectors of (S 5) X orfh({(S = q+l,...,p}). 

Again, these column vectors are not orthonormal in general, but there exists an orthonormal basis 
that spans the same subspace. This orthonormal basis is hence spanned by the column vectors 
of orth{{'E~ 2 ) X orth{{{'E~ 2 yvj\j = q + l,...,p})). Since orth{{'E~ 2 ) x orth{{{'E~^yVj\j = 
q + l, ...,p})) = orth({Yl~^Vj\j = q + 1 , ...,p})|^the subspace that matters for classihcation is thus 
spanned by the column vectors of orth{{'E~^Vj\j = q + 1, ...,p}). 

In summary, only the linear projection of the data onto a subspace with the same dimension 
as IX matters for classihcation. 

Appendix C: Derivation of jikr in GEM 

We derive the optimal aa^^’s under constraint (4) for a given S. Note that the term in Eq. (6) 
that involves At^^’s is: 

A Rk 

- 2 . (13) 

k=l r=l i=l 

^ Let matrix Ahe a pxp square matrix and B be a p x d matrix, d < p, it can be proved that orth{A x orth{B)) = 
orth{A X B). 
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Denote Ym=i %kr by 4r- Let Qi,kr^i/^kr, be., the weighted sample mean of the compo¬ 


nent r in class k. To maximize Eq. (13) is equivalent to minimizing the following term (Anderson 


2000 ): 


K Rk 

'y ^ ^ ^kri^kr ~ l^kr) ^ (*fcr ~ l^kr) ■ 

k=l r=l 


(14) 


To solve the above optimization problem under constraint (|^, we need to hnd a linear transform 
such that in the transformed space, the constraint is imposed on individual coordinates (rather 
than linear combinations of them), and the objective function is a weighted sum of squared Eu¬ 
clidean distances between the transformed Xkr and Once this is achieved, the optimal solution 
will simply be given by setting those unconstrained coordinates within each component by the 
component-wise sample mean, and the constrained coordinates by the component-pooled sample 
mean. We will discuss the detailed solution in the following. 

Find the matrix square root of S, that is, S = (S 2 )^X 2 . If the eigen decomposition of S is 
S = then, Ss = D^V^. Now perform the following change of variables: 

K Rk 

'y ^ ^ ^kri^kr ~ l^kr) ^ (*A:r ~ l^kr) 

k=l r=l 

K Rk r . , . It r 


S 2 j (ajfcr f^kr) 


k=l r=l 
K Rk 

'y ^ ^ ^kri^kr ~ f^kr) {^kr ~ P'kr) 5 


S 2 j (^Xkr f^kr) 


(15) 


k=l r=l 


where = ( ^ ^ ) 'k^'kn x^^ = 2 j -Xkr- Correspondingly, the constraint in (4) becomes 

= constant over r and k , j = 1 , •••, q ■ (16) 

Let bj = Ti^Vj and B = (hi, 62 , •••, bq). Note that the rank of V = (r^i, ■■■,Vq) is q. Since S 2 is of 


full rank, B = 1] 2 V also has rank q. The constraint in (16) becomes 


, for any r, r' = 1,..., Rk, and any k, k' = 1,K . 


(17) 


Now perform a singular value decomposition (SVD) on B, i.e., B = UbDbVb^, where Vb is a 
q X q orthonormal matrix, Db is a q x q diagonal matrix, which is non-singular since the rank of 
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B is g, and Ub is a p x q orthonormal matrix. Substituting the SVD of B in (IT^, we get 


VbDbUb" pf-kr = VbDbUb^ pf-k'r'i Rk^ and any k, k' = 1,..., K , 


which is equivalent to 


UB^/J-kr = UB^pi-k'r'i ■i ^k, and any k, k' = 1,..., K, 


(18) 


because Vb and Db have full rank. We can augment Ub to a p x p orthonormal matrix, 
U = {ui, ...,Ug,Ug+i, ...,Up), where Wg+i, ..., Up are augmented orthonormal vectors. Since IJ 


is orthonormal, the objective function in Eq. (15) can be written as 
K Rk K Rk 


^ ^ ^ ^ ^kr[U i^kr P'j)] ' {^kr P'kr)] ~ ^ ^ ^ ^ lkr{,^kr f^kr) i^kr f^kr) i (^ 9 ) 


fe=l r=l 


k=l r=l 


where Xkr = U Xkr and = U pL^^. If we denote = {pkr,i, Pkrp, ■■■, Pkr,py, then the con¬ 


straint in (18) simply becomes 


Pkrj = Pk'r'j , for auy r, r' = 1,..., Rk, and any k, k' = 1,..., K, j = 1, ...,q . 

That is, the hrst q coordinates of fi have to be common over all the k and r. The objective 


function (19) can be separated coordinate wise: 


K Rk 

EE' 


p K Rk 


kri^^j - RkrYiS^kr - Rkr) = EEE Ikr {Xkr,j Pkr,j ) 


k=l r=l j=l A:=l r=l 

For the hrst q coordinates, the optimal pkrj, j = 1, •••, <?, is solved by 

.. ^ Ek'=iEXiJk'r'Xk'r',j ^ ELiEXih'Ar',j r and . 

Z^k'=l Z^r'=l ‘'k'r' 

For the remaining coordinates, Pkrj, j = <? + 1, 

W 5); W 

Pkr,j ^kr,j ■ 

After jll^ is calculated, we hnally get under the constraint (|^: 
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Appendix D: Reduced Rank Mixture Discriminant Analysis 

The rank restriction can be incorporated into the mixture discriminant analysis (MDA). It is 
known that the rank-L LDA fit is equivalent to a Gaussian maximum likelihood solution, where 


the means of Gaussians lie in a L-dimension subspace (Hastie and Tibshirani, 1996). Similarly, in 
MDA, the log-likelihood can be maximized with the restriction that all the R = 'Yl!k=i centroids 
are conhned to a rank-L subspace, i.e., rank {fikr} = L. 

The EM algorithm is used to estimate the parameters of the reduced rank MDA, and the M- 
step is a weighted version of LDA, with R “classes”. The component posterior probabilities Qi^kr^s 
in the E-step are calculated in the same way as in Eq. ([^, which are conditional on the current 
(reduced rank) version of component means and common covariance matrix. In the M-step, vr^^’s 
are still maximized using Eq. ([^. The maximizations of and S can be viewed as weighted 
mean and pooled covariance maximum likelihood estimates in a weighted and augmented i?-class 
problem. Specihcally, we augment the data by replicating the Uk observations in class k Rk times, 
with the Ith such replication having the observation weight qi^ki- This is done for each of the K 
classes, resulting in an augmented and weighted training set of Ylik=i ^kRk observations. Note 
that the sum of all the weights is n. We now impose the rank restriction. For all the sample 
points XiS within class k, the weighted component mean is 

E rin 

i=l Qi^kr^i 






kr 


Let q'kr = Er=i QiM- The overall mean is 


Ef=l E"=l QkrRh. 


Rk 


R = 


Efc=l Er=l ^'kr 

The pooled within-class covariance matrix is 

Ef=i ESi Er=i - RkT)\xi - 


w = 

The between-class covariance matrix is 


Ef=i Er=i (I'h. 


Rk 


B = 


Efc=l Er=l (lkri^^kr “ ^YiRkr “ R) 


E K / 

k=l jL-^r=l ^k 


Rk 


Dehne B* = {W ^yBW 2 . Now perform an eigen-decomposition on B*, i.e., B* = V*DbV* , 
where V* = (vyvy...,v*). Let E be a matrix consisting of the leading L columns of W~^V*. 
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Considering maximizing the Ganssian log-likelihood snbject to the constraints rank 
the solntions are 




(20) 


E = hh 


Efc=l Er=l ^kri^^kr “ ^^krYi^^kr “ Afcr 


( 21 ) 


sr^Rk „/ 
k=l 2-^r=l ^kr 

As a snmmary, in the M-step of rednced rank MDA, the parameters, vr^^, and S, are 
maximized by Eqs. 0, (@, and d^ , respectively. 

Note that the discriminant snbspace is spanned by the colnmn vectors of C = W~^V*, with 
the Zth discriminant variable as W~^Vi. In general, W~^Vi^s are not orthogonal, bnt we can End 
an orthonormal basis that spans the same snbspace. 
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